ECCV Caption: Correcting False Negatives by Collecting Machine-and-Human-verified Image-Caption Associations for MS-COCO

نویسندگان

چکیده

Image-Text matching (ITM) is a common task for evaluating the quality of Vision and Language (VL) models. However, existing ITM benchmarks have significant limitation. They many missing correspondences, originating from data construction process itself. For example, caption only matched with one image although can be other similar images vice versa. To correct massive false negatives, we construct Extended COCO Validation (ECCV) Caption dataset by supplying associations machine human annotators. We employ five state-of-the-art models diverse properties our annotation process. Our provides $$\times $$ 3.6 positive image-to-caption 8.5 caption-to-image compared to original MS-COCO. also propose use an informative ranking-based metric mAP@R, rather than popular Recall@K (R@K). re-evaluate 25 VL on proposed benchmarks. findings are that benchmarks, such as 1K R@K, 5K CxC R@1 highly correlated each other, while rankings change when shift ECCV mAP@R. Lastly, delve into effect bias introduced choice annotator. Source code available at https://github.com/naver-ai/eccv-caption

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Lingual Image Caption Generation

Automatically generating a natural language description of an image is a fundamental problem in artificial intelligence. This task involves both computer vision and natural language processing and is called “image caption generation.” Research on image caption generation has typically focused on taking in an image and generating a caption in English as existing image caption corpora are mostly ...

متن کامل

Topic-Specific Image Caption Generation

Recently, image caption which aims to generate a textual description for an image automatically has attracted researchers from various fields. Encouraging performance has been achieved by applying deep neural networks. Most of these works aim at generating a single caption which may be incomprehensive, especially for complex images. This paper proposes a topic-specific multi-caption generator, ...

متن کامل

Multimodal Pivots for Image Caption Translation

We present an approach to improve statistical machine translation of image descriptions by multimodal pivots defined in visual space. Image similarity is computed by a convolutional neural network and incorporated into a target-side translation memory retrieval model where descriptions of most similar images are used to rerank translation outputs. Our approach does not depend on the availabilit...

متن کامل

Image Caption Generation with Recursive Neural Networks

The ability to recognize image features and generate accurate, syntactically reasonable text descriptions is important for many tasks in computer vision. Auto-captioning could, for example, be used to provide descriptions of website content, or to generate frame-by-frame descriptions of video for the vision-impaired. In this project, a multimodal architecture for generating image captions is ex...

متن کامل

Deep image representations using caption generators

Deep learning exploits large volumes of labeled data to learn powerful models. When the target dataset is small, it is a common practice to perform transfer learning using pre-trained models to learn new task specific representations. However, pre-trained CNNs for image recognition are provided with limited information about the image during training, which is label alone. Tasks such as scene r...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2022

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-20074-8_1